NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Does compressing activations help model parallel training?

Bian, Song; Li, Dacheng; Wang, Hongyi; Xing, Eric P; Venkataraman, Shivaram (May 2024, Seventh Annual Conference on Machine Learning and Systems)

Foundation models have superior performance across a wide array of machine learning tasks. The training of these models typically involves model parallelism (MP) to navigate the constraints of GPU memory capacity. However, MP strategies involve transmitting model activations between GPUs, which can hinder training speed in large clusters. Previous research has examined gradient compression in data-parallel contexts, but its applicability in MP settings remains largely unexplored. In this paper, we investigate the unique characteristics of compression in MP and study why strategies from gradient compression might not be directly applicable to MP scenarios. Subsequently, to systematically understand the capabilities and limitations of Model Parallelism Compression, we present a benchmarking framework MCBench. MCBench not only includes four major categories of compression algorithms but also includes several widely used models spanning language and vision tasks on a well-established distributed training framework, Megatron-LM. We initiate the first comprehensive empirical study by using MCBench. Our empirical study encompasses both the fine-tuning and pre-training of FMs. We probe over 200 unique training configurations and present results using 10 widely used datasets. To comprehend the scalability of compression advantages with the expansion of model size and cluster size, we propose a novel cost model designed specifically for training with MP compression. The insights derived from our findings can help direct the future development of new MP compression algorithms for distributed training. Our code is available at https://github.com/uw-mad-dash/MCBench
more » « less
Full Text Available
Contextualized: Heterogeneous Modeling Toolbox

https://doi.org/10.21105/joss.06469

Ellington, Caleb N; Lengerich, Benjamin J; Lo, Wesley; Alvarez, Aaron; Rubbi, Andrea; Kellis, Manolis; Xing, Eric P (May 2024, Journal of Open Source Software)

Full Text Available
Understanding Masked Autoencoders via Hierarchical Latent Variable Models

https://doi.org/10.1109/CVPR52729.2023.00765

Kong, Lingjing; Ma, Martin Q.; Chen, Guangyi; Xing, Eric P.; Chi, Yuejie; Morency, Louis-Philippe; Zhang, Kun (June 2023, IEEE)
The Two Dimensions of Worst-case Training and Their Integrated Effect for Out-of-domain Generalization

https://doi.org/10.1109/cvpr52688.2022.00941

Huang, Zeyi; Wang, Haohan; Huang, Dong; Lee, Yong Jae; Xing, Eric P. (June 2022, Conference on Computer Vision and Pattern Recognition (CVPR))

Training with an emphasis on “hard-to-learn” components of the data has been proven as an effective method to improve the generalization of machine learning models, especially in the settings where robustness (e.g., generalization across distributions) is valued. Existing literature discussing this “hard-to-learn” concept are mainly expanded either along the dimension of the samples or the dimension of the features. In this paper, we aim to introduce a simple view merging these two dimensions, leading to a new, simple yet effective, heuristic to train machine learning models by emphasizing the worst-cases on both the sample and the feature dimensions. We name our method W2D following the concept of “Worst-case along Two Dimensions”. We validate the idea and demonstrate its empirical strength over standard benchmarks.
more » « less
Full Text Available
Toward Learning Human-aligned Cross-domain Robust Models by Countering Misaligned Features

Wang, Haohan; Huang, Zeyi; Zhang, Hanlin; Lee, Yong Jae; Xing, Eric P. (January 2022, Conference on Uncertainty in Artificial Intelligence (UAI))

Machine learning has demonstrated remarkable prediction accuracy over i.i.d data, but the accuracy often drops when tested with data from another distribution. In this paper, we aim to offer another view of this problem in a perspective assuming the reason behind this accuracy drop is the reliance of models on the features that are not aligned well with how a data annotator considers similar across these two datasets. We refer to these features as misaligned features. We extend the conventional generalization error bound to a new one for this setup with the knowledge of how the misaligned features are associated with the label. Our analysis offers a set of techniques for this problem, and these techniques are naturally linked to many previous methods in robust machine learning literature. We also compared the empirical strength of these methods demonstrated the performance when these previous techniques are combined, with implementation available here
more » « less
Full Text Available
Poly(A)-DG: A deep-learning-based domain generalization method to identify cross-species Poly(A) signal without prior knowledge from target species

https://doi.org/10.1371/journal.pcbi.1008297

Zheng, Yumin; Wang, Haohan; Zhang, Yang; Gao, Xin; Xing, Eric P.; Xu, Min (November 2020, PLOS Computational Biology)
Zhang, Zhaolei (Ed.)
In eukaryotes, polyadenylation (poly(A)) is an essential process during mRNA maturation. Identifying the cis -determinants of poly(A) signal (PAS) on the DNA sequence is the key to understand the mechanism of translation regulation and mRNA metabolism. Although machine learning methods were widely used in computationally identifying PAS, the need for tremendous amounts of annotation data hinder applications of existing methods in species without experimental data on PAS. Therefore, cross-species PAS identification, which enables the possibility to predict PAS from untrained species, naturally becomes a promising direction. In our works, we propose a novel deep learning method named Poly(A)-DG for cross-species PAS identification. Poly(A)-DG consists of a Convolution Neural Network-Multilayer Perceptron (CNN-MLP) network and a domain generalization technique. It learns PAS patterns from the training species and identifies PAS in target species without re-training. To test our method, we use four species and build cross-species training sets with two of them and evaluate the performance of the remaining ones. Moreover, we test our method against insufficient data and imbalanced data issues and demonstrate that Poly(A)-DG not only outperforms state-of-the-art methods but also maintains relatively high accuracy when it comes to a smaller or imbalanced training set.
more » « less
Full Text Available
Automating Dependence-Aware Parallelization of Machine Learning Training on Distributed Shared Memory

https://doi.org/10.1145/3302424.3303954

Wei, Jinliang; Gibson, Garth A.; Gibbons, Phillip B.; Xing, Eric P. (March 2019, EuroSys '19: Proceedings of the Fourteenth EuroSys Conference)

Machine learning (ML) training is commonly parallelized using data parallelism. A fundamental limitation of data parallelism is that conflicting (concurrent) parameter accesses during ML training usually diminishes or even negates the benefits provided by additional parallel compute resources. Although it is possible to avoid conflicting parameter accesses by carefully scheduling the computation, existing systems rely on programmer manual parallelization and it remains a question when such parallelization is possible. We present Orion, a system that automatically parallelizes serial imperative ML programs on distributed shared memory. The core of Orion is a static dependence analysis mechanism that determines when dependence-preserving parallelization is effective and maps a loop computation to an optimized distributed computation schedule. Our evaluation shows that for a number of ML applications, Orion can parallelize a serial program while preserving critical dependences and thus achieve a significantly faster convergence rate than data-parallel programs and a matching convergence rate and comparable computation throughput to state-of-the-art manual parallelizations including model-parallel programs.
more » « less
Full Text Available

Search for: All records